0.1 List of Tables

-Table @ref(Community Health Outcome Summary Table)
-Table 3:Community Health Z Score Summary Table
-Table 4:Summary table of county-level explanatory socioeconomic variables in Pennsylvania (Code line 559)

Figure @ref(fig:Coal Production plotted over time at each Pennsylvania County)=“Coal Production plotted over time at each Pennsylvania County”) Figure @ref(fig.cap=“Coal Production plotted over time at each Pennsylvania County”) -Figure @ref(fig.cap=“Map of Z scores Health Outcome by County in 2019.”) #mapview may not show up -Figure @ref(fig.cap=“Map of Health Outcome Ranks by County in 2019”) #mapview may not show up -Figure @ref(fig.cap=“Map of Z scores Health Outcome by County in 2010”) #mapview may not show up -Figure @ref(fig.cap=“Map of Health outcomes ranking by county in 2010”) #mapview may not show up -Figure @ref(fig.align=‘left’, fig.cap=“Graph depicting median age, median income, total population, percent of population with a bachelor’s degree and percent of population which is white-identifying for Pennsylvania counties in 2019. Data sourced from ACS 1-year estimates.”) -Figure @ref(fig.cap=“Synchronous plot of health outcomes and health factors z scores over time across all counties”) -Figure @ref(fig.cap=“Plot of coal production over time across all counties”) -Figure @ref(fig.cap=“Synchronous plot of coal production and health factors z score over time in the Schuylkill county”) -Figure @ref(fig.cap=“Synchronous plot of coal production and health outcomes z score over time in the Schuylkill county”) -Figure @ref( fig.cap=“Synchronous plot of coal production and health factors z score over time in the Armstrong county”) -Figure @ref(fig.cap=“Synchronous plot of coal production and health factors z score over time in the Clearfield county”) -Figure @ref(fig.cap=“Synchronous plot of coal production and health outcomes z score over time in the Clearfield county”) -Figure @ref(fig.cap=“Mixed correlation plot for all variables of interest. Blue values are positive and red values are negative. Correlations performed using Pearson’s correlation coefficient.”) -Figure @ref(fig.cap=“Initial regression results including all socioeconomic and coal production variables regressed on health outcomes z-score. Adj. R-squared = 0.5189, 14 degrees of freedom.”) -Figure @ref(fig.cap=“Akaike Information Criteria (AIC) stepwise results”, fig.height=2.5) -Figure @ref(fig.cap=“Second linear regression following AIC value recommendations for reduced overfitting. Adjusted R-squared: 0.7322, 33 degrees of freedom”) -Figure @ref(fig.cap=“Plots to examine multiple linear regression assumptions (homoscedasticity, normality)”) -Figure @ref(fig.cap=“Plots to examine linearity assumption of multiple linear regression model.”) -Figure @ref(fig.cap=“SO4 Levels Seen in Acid Mine Drainage in Pennsylvania”)

0.2 Introduction

Coal-fired power plants have well-documented negative impacts on community health. (Mueller, 2022). The impact of coal mining, however, has been less closely examined. Mining activities release coal dust, ashes, metals, and aromatic hydrocarbons into the air, which can be a health hazard for workers. This can lead to elevated levels of silicon and aluminum in the body (G. leno-Meija et al, 2014). In addition, research studies have shown that coal mine workers have health effects such as black lung disease(Gasparotto, 2021).

The impact of coal production on community health is particularly important for the state of Pennsylvania, the country’s fourth coal producer. Pennsylvania has a long history of coal production, starting in the 1700s(PA DEQ). This coal production drove population growth and infrastructure throughout the state. Coal also drove Pennsylvania’s successful steel industry. Today, however, these counties have been largely abandoned by offshoring of industrial production, leaving much of Pennsylvania in the ‘rust belt’ - the swath of America once driven by manufacturing but now reliant on dwindling economic resources and home to increasing numbers of community health challenges.

This study examines the impacts of historical coal mining on community health in Pennsylvania. We more specifically examine both the impact of total coal production and the impact of temporal distance from peak coal production on county-level health outcomes in the state. By understanding the influence of coal mining on community health, we hope to be able to more effectively target community health interventions to counties most in need.

We examine this question through the use of four distinct county-level datasets: coal production data from the U.S. Energy Informational Administration (USIEA), census data from the American Community Survey (ACS), county health rankings data from the University of Wisconsin’s Community Health Rankings and Roadmap program, and Water quality data from the Water Quality Portal, a partnership between United States Geological Survey and the Environmental Protection Agency. These datasets collectively covered a broad enough time series to observe impacts in coal production over time. All focus on the state of Pennsylvania.

0.3 Rationale and Research Questions

  1. Has coal production impacted local community health in Pennsylvania?
  2. How do the demographics of Pennsylvania counties vary, and how are those demographics associated with community health?
  3. How do coal production and community health vary over time? Are they linked temporally or spatially?

0.4 Dataset Information

A summary of the four datasets and the individual data wrangling process is described below.

0.4.1 Coal Mining Data

Our coal mining dataset provides data on annual coal production by Pennsylvania county. These data were downloaded from the U.S. Energy Information Administration website (https://www.eia.gov/coal/data.php) and cover the years from 1983 until 2021. The data was submitted across the united states by each individual mining operation and includes extensive information from the union the mine is organized under to the number of employees. It is submitted through the annual survey of coal production and communication, a mandated survey.

The wrangling process can be seen step by step below, but a summary is as follows: The coal data for each individual year was uploaded using the read.csv function. Then, the class of different columns were adjusted so that they were consistent across years and could be merged. At this point, they were merged with the full_join command such that one dataframe was formed. This new dataset was then filtered to only include data from Pennsylvania and then slimmed to only include the columns of interest (county, year, status, and production in tons). At this point, we needed to merge the columns where multiple mines were operating within a single county. This was done by converting the production to class integer, grouping by county and year, and summing the coal production within those groups. The weighted average year of coal production for each county was then calculated by summing the product of coal produced and year for each year and dividing by the total sum of coal production within each county. Finally, a column expressing the sum of coal production across all years for each county was added. At this point, this more refined dataset was saved to the processed data folder for future analysis.

Coal Wrangling Code

Coal Summary Table The summary table below provides the range, median, and interquartile range for each of the parameters included.

Summary Statistics of County-Level Coal Production Variables in Pennsylvania
Variable Observations Median Mean SD Range Minimum Maximum
Coal Production per Year 1,150 155,322.00 386,138.67 531,178.11 2,931,698.00 555.00 2,932,253.00
Weighted Year of Peak Production 1,150 2,001.60 2,002.11 4.14 31.94 1,989.06 2,021.00
Total Coal Produced 1,150 6,959,278.00 14,782,247.15 18,932,716.79 80,313,637.00 60,828.00 80,374,465.00

0.4.2 Water Quality Data

As a tangential thought to begin understanding the possible linkage between coal mining and community health, we investigated water quality in the surrounding mines. Due to struggles to find a thorough account of water quality in Pennsylvania, this was not a core dataset used and so it is not integrated into our main analysis. Still, it is mentioned within the conclusions and so we have included the wrangling process utilized here. The data comes from the Water Quality Portal, a partnership between United States Geological Survey and the Environmental Protection Agency. It is the conglomeration of US water quality data from multiple databases. The data utilized here primarily came from the STORET warehouse.

The water quality data was uploaded as well as the collection site data. These two datasets were then merged together by the unifying monitoring location identifier. We could then filter for only water quality measurements collected in Pennsylvania, the focus of this study, and condense the data to the components of interest. Finally, we filtered for only sulfate (as SO4) readings as that is the angle we investigate within the conclusions.

0.4.3 Community Health Indicators

Community Health data was pulled from Community Health Rankings and Roadmap, a program of the University of Wisconsin Population Health Institute. This health rankings model was developed by the University and aims to represent how healthy a county is. And this dataset contains various indicators that have been composited into a few different summary variables. These composite variables include Health Outcomes, Health factors.

Health outcomes represent the county’s level of health. This level of health is calculated by classifying a variety of health outcomes - such as morbidity, life expectancy, birth weight, premature death etc. - into ‘length of life’ and ‘quality of life’ groupings. These groupings are then aggregated into a single ‘health outcomes’ score for the county in the given year. Health factors in contrast produce a ranking based on variables which determine the outcomes listed above, such as access to clinical care or socioeconomic variables. You can learn more about the halth ranking model at:(https://www.countyhealthrankings.org/explore-health-rankings/county-health-rankings-model). The scores from both health outcomes and health factors are output as either ranks. A rank of 1 indicates the halthiest county with decreasing health as the number increases. The Z scores provide a continuous variable that represents the distance of the given county from the mean health outcome/factor value for all of counties. A decreasing z-score indicates improving health of the county.

The datasets for Pennsylvania were downloaded from the community Health Rankings website for years 2010-2020. The data set consists of one value for each indicator by county per year. Each individual year was downloaded and then combined through the rbind function in R. Only the columns of interest were selected to create a health data frame for further analysis. The classification of each column was changed to allow for plotting and other visual representations. The wrangled health data frame includes health outcomes Z score, health outcomes rank, health factors z score, and health factors rank by county from 2010-2020 by county. This data frame is saved in the processed data folder.

Community Health Data Wrangling Code

A spatial shapefile by county was also utilized to display this information. This shapefile developed by the U.S. Census is their cartographic boundary shapefile from 2020. This dataset includes the entire United States, so the state of Pennsylvania was filtered out to be combined with the health data frame. This allows community health indicators to be visualized spatially. To compare coal and non coal producing counties, an additional dataframe was created that filtered out only coal producing counties. Below is a summary tables capturing important variables in the health data frame.

Spatial Wrangling Code

## Reading layer `cb_2018_us_county_20m' from data source 
##   `/Users/hughcipparone/Documents/Nic School Fall 2022/V3_Cipparone_Niva_OCallaghan_EDA_Finalproject/RawData/cb_2018_us_county_20m.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3220 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
## Geodetic CRS:  NAD83
##      xmin      ymin      xmax      ymax 
## -80.51989  39.72006 -74.69491  42.26986
## [1] "character"
## [1] 1

Summary tables describing the dimensions of the health data set are below. It should be noted that the variables are ranks and Z scores and therefore the dimensions of the dataset is simple.

Community Health Outcome Summary Table

healthdf <- read.csv("./ProcessedData/healthdf.csv", stringsAsFactors = TRUE)

healthdf.table <- healthdf %>%
    group_by(year) %>%
    summarize(mean_Health_Outcome_rank = mean(Health.Outcomes.Rank), min_health_outcome_rank = min(mean_Health_Outcome_rank),
        max_Health_Outcome_rank = max(Health.Outcomes.Rank), sd_Health_Outcome_rank = sd(Health.Outcomes.Rank))

kable(x = healthdf.table, format = "html", col.names = c("Year", "Mean Rank",
    "Min Rank", "Max Rank", "SD Rank"), caption = "Health Indicators Outcomes Summary Table") %>%
    kable_classic(full_width = F, html_font = "Cambria")
Health Indicators Outcomes Summary Table
Year Mean Rank Min Rank Max Rank SD Rank
2010 34 34 67 19.48504
2011 34 34 67 19.48504
2012 34 34 67 19.48504
2013 34 34 67 19.48504
2014 34 34 67 19.48504
2015 34 34 67 19.48504
2016 34 34 67 19.48504
2017 34 34 67 19.48504
2018 34 34 67 19.48504
2019 34 34 67 19.48504
2020 34 34 67 19.48504
# kable(healthdf.table,caption = 'Health Indicators Outcomes Summary
# Table')
Community Health Z Score Summary Table
Health Indicators Z Score Summary Table
Year Mean Z Min Z Max Z SD Z
2010 0.0002985 0.0002985 3.43 0.7756170
2011 -0.0001493 -0.0001493 3.28 0.7681984
2012 -0.0002985 -0.0002985 3.20 0.7843545
2013 -0.0010448 -0.0010448 2.62 0.7817117
2014 -0.0022388 -0.0022388 2.56 0.8032487
2015 -0.0200000 -0.0200000 1.99 0.7159418
2016 -0.0111940 -0.0111940 3.01 0.7790992
2017 0.0001493 0.0001493 2.80 0.7808688
2018 0.0004478 0.0004478 2.54 0.7874555
2019 -0.0002985 -0.0002985 2.38 0.7793936
2020 -0.0005970 -0.0005970 2.22 0.7830242

0.4.4 American Community Survey

The third dataset - the ACS dataset - comes from the American Community Survey. This survey has been run by the US Census Bureau once a year since 2005, and provides information on a wide range of socioeconomic variables at multiple spatial scales, including the county level. We chose to use the single-year estimates (other options are estimates for socioeconomic variables extending over five or ten year periods) to match our single year data for coal production over time. You can learn more at: https://www.census.gov/programs-surveys/acs

We collected these data through the tidycensus R package, which pulls the data using an Application Programming Interface (API) (https://walker-data.com/tidycensus/). We specifically pulled data on the county level within Pennsylvania for five variables: total population, total white population, median income, median age and population with a bachelor’s degree. We selected these five variables due to the well documented importance of income, education, and age on human health (https://pubmed.ncbi.nlm.nih.gov/8820314/, https://www.cdc.gov/minorityhealth/racism-disparities/index.html) and therefore their potential role as confounding variables in the relationship between coal production and community health.

We wrangled these data using the tidyverse package to capture the percentage of the county which was white-identifying - rather than the total population of white people - and the percentage of the county with a bachelor’s degree - rather than the total population. Each of these modifications permitted us to compare race and education rates in counties with different total populations. We also removed any counties which reported populations which had zero white people, assuming that these data points could not be accurate in the majority-white US. We only used 2019 socioeconomic variables in the multiple linear regression at the heart of this project.

American Community Import and Wrangle

American Community Summary Table The summary statistics table below captures a number of important characteristics of the collected variables from the year 2019:

Table 1: County-level Summary Statistics of Socioeconomic Variables in Pennsylvania
Variable Observations Median Mean SD Range Minimum Maximum
X 40 20.50 20.50 11.69 39.00 1.00 40.00
Total Population 40 167,100.00 296,177.88 316,664.55 1,519,882.00 64,182.00 1,584,064.00
Median Age 40 42.30 42.47 3.16 14.20 33.40 47.60
Median Income (USD) 40 30,676.00 31,922.28 5,099.44 21,142.00 25,728.00 46,870.00
% White-Identifying 40 87.48 80.07 25.28 95.94 0.00 95.94
% with Bachelor's Degree 40 11.52 12.17 3.50 14.12 7.78 21.90
year 40 2,019.00 2,019.00 0.00 0.00 2,019.00 2,019.00

0.4.5 Combined Datasets

After collecting each of our individual datasets, we created two combined datasets - one for the multiple linear regression and another for the time series analysis examining the relationship between health outcomes and coal production over time.

We created the dataset for the multiple linear regression by creating a coal dataset for all of the coal counties with only the total production over time and the calculation of the number of years from peak production to 2019. This was grouped by county (previous calculations for these values had already occurred during initial wrangling of the coal dataset). This was then left joined to the health dataset which had been filtered to only include health values for 2019. This retained all of the counties included in the health dataset while integrating the production values for the counties which produced coal. We then imputed values of ‘0’ total coal production for the counties with NAs in that field - NAs produced by the fact that they weren’t in the coal dataset and therefore had not seen any coal production over time. We then imputed values of ‘37’ for the years from peak production to 2019, as the coal dataset ran from 1983. Thirty-seven years permitted us to use these counties in the multiple linear regression, placed these non-coal producing counties as counties with long temporal distance from the present - something you would assume would be associated with presumably healthy counties given their low coal production, gave us a value with reason - 37 years is the temporal distance from 2019 to 1982, the year before we have any data - and gave us a value that wouldn’t extensively skew the data. Finally we left joined this dataset to our ACS data, and subtracted any counties with ‘NAs’ and any counties with ’0’s for total white population, as we assumed that these data were incorrect given the racial composition of the US and the other counties (all of whose percentage of white people were around 70% with the sole exception of Philadelphia). This left us with a sample size of 37 counties. This data set is saved to the processed data folder.

Multiple Linear Regression Wrangling

We created the dataset for the time series analysis by utilizing data from the aggCoalData csv and the health.data csv previously saved and covered above. First, we read the dataframes into r. We then had to rename the year column for the health data frame so that it had the same title as the year column for the aggCoalData. At that point, we were able to merge (using the two datasets) the two data sets by county and year and select the columns necessary for the analysis (Year, county, health outcomes z score, health factors z score, and coal production). Note that this only included the counties that produced coal at some point but as we were looking for a correlation between coal production and health z scores over time, we did not need to factor in the counties with zero coal production. Also note that this dataset only covered the years from 2010 until 2020. These are the only years that the health data was available for and so the only years pertinent to include in this dataset. At this point, we had a dataframe with concurrent measures of coal production and health z scores for each coal producing county in the year span of 2010 to 2020.

Time Series Analysis Wrangling

0.5 Exploratory Analysis

Exploratory analysis was conducted on the individual data to trends for each different data categories independently.

0.5.1 Coal Production

Coal production over time

CoalOverTime <- ggplot(aggCoalSlim, aes(Year, sum.coal.production, group = County,
    color = County)) + geom_smooth() + ylab("Total Coal Production (Tons)") +
    ggtitle("Coal Production over Time in each Pennsylvania County")

print(CoalOverTime)
Coal Production plotted over time at each Pennsylvania County

Coal Production plotted over time at each Pennsylvania County

The plot over time shows how the coal production has changed over time from 1983 until today in each county. The more recent petering off of most of the mines makes sense as most of the available coal has been removed and the US energy sources have moved away from coal. There is a pretty strong bell curve in most mines which would align with the location of a good mine, the mass export of the mined coal, and then the decrease in production as the most readily available coal is gone. It is interesting to see how all the counties seem to steeply increase in coal production from 1983 on. This could be a limitation in the dataset (only giving strong coverage of mines beginning operation in 1983) but none of that was explained or visible in the metadata and so it is more likely that there was an uptick in coal production at that point. This could be due to the implementation of longwall mining which first began in the 1960s (https://www.dep.pa.gov/Business/Land/Mining/Pages/PA-Mining-History.aspx). It is also clearly visible here that Schuylkill County has far more coal production than any other county.

0.5.2 Community Health Indicators

A spatial analysis was utilized to display health outcomes rank and Z score by county for 2019 and 2010. 2010 was chosen because it is the first year of available data, and 2019 was chosen because it is the year utilized for the multilinear regression. A side-by-side map view is shown for all the health visualizations to compare all counties on the left pane with only coal-producing counties on the right pane. Rankings range from 1-67 because there are 67 counties in Pennsylvania. The dataset from EIA indicated that 37 of these 67 counties have, at some point, produced coal. These 37 counties were subset for the the comparison. Additionally, counties on the map can be hovered over to see the indicator value mapped and the specific county information.

# 2019 Z score
all.Z.2019 <- mapview(Health.2019, zcol = "Health.Outcomes.Z.Score", col.regions = brewer.pal(10,
    "OrRd"))


coal.z.2019 <- mapview(coal.2019, zcol = "Health.Outcomes.Z.Score", col.regions = brewer.pal(10,
    "OrRd"))

sync(all.Z.2019, coal.z.2019)

This figure shows that most of the counties with a high Z score, darker orange indicating poor health, are shown on the right pannel indicating coal producing counties. There are a few exceptions, for example, the County of Philadelphia, which also includes the City of Philadelphia, the largest city in the State of Pennsylvania (source). The other major exception is the County of Centre, the county shown middle, which had relatively low total coal production (2,543,689 tons) compared to neighboring counties.

# (Caption figure XX: Health outcomes ranking by county for 2019.)

# 2019 Rank
all.rank.2019 <- mapview(Health.2019, zcol = "Health.Outcomes.Rank", col.regions = brewer.pal(20,
    "Blues"))

coal.rank.2019 <- mapview(coal.2019, zcol = "Health.Outcomes.Rank", col.regions = brewer.pal(20,
    "Blues"))

sync(all.rank.2019, coal.rank.2019)

This figure shows the heather counties in a lighter blue county and decreasing health towards the darker blue. A seven coal-producing counties with relatively low health rankings below 20 are Bulter, Bedford, Clinton, Centre, Clinton, Montogomery and Tiago. These are the exceptions, as most health rankings for coal-producing counties have a poor health ranking visualized by the darker blue colors. Keep in mind this graph represents counties that have had any amount of coal production.

# (Caption figure XX: Shows Z scores of health outcome by County in 2010)

# 2010 Z score
all.z.2010 <- mapview(Health.2010, zcol = "Health.Outcomes.Z.Score", col.regions = brewer.pal(5,
    "OrRd"))

coal.z.2010 <- mapview(coal.2010, zcol = "Health.Outcomes.Z.Score", col.regions = brewer.pal(5,
    "OrRd"))

# showing maps side by side
sync(all.z.2010, coal.z.2010)

The same analysis as the maps above was conducted for the year 2010. Looking at the left pane, there is one county stands out. This county is Philadelphia, which has a Z score in 2010 of 3.43. The next highest Z score for that year is 1.21 in Greene County, a coal-producing county. This county is displayed at the bottom left on the Pennsylvania map. The scale of the right pane is slightly different due to Philadelphia County not being included. A few counties with a history of coal production have a low Z score and, therefore, better health. Centre county has the lowest Z score of -1.53, the county with the lightest orange color in the right pane. Butler county is the next healthiest with a z score of -0.71. As you can see from the visualization in this figure, most counties in 2010 with the highest z score indicating poor health are counties that have a history of local production.

# (Caption figure XX: Health outcomes ranking by county in 2010.)

# 2010 Rank
all.rank.2010 <- mapview(Health.2010, zcol = "Health.Outcomes.Rank", col.regions = brewer.pal(20,
    "Blues"))

coal.rank.2010 <- mapview(coal.2010, zcol = "Health.Outcomes.Rank", col.regions = brewer.pal(20,
    "Blues"))

sync(all.rank.2010, coal.rank.2010)

Similar patterns emerge in the 2010 ranking as they did in the 2019 map. In 2010, eight out of the 37 coal-producing counties had health rankings below 20. These are Centre, Bulter, Columbia, Indiana, Montgomery, Berks, Westmoreland, and Jefferson. Rankings of counties with the poorest health 50-67 are composed mostly of coal-producing counties except Philadelphia ranked 67, Wayne ranked 62, Forest ranked 53 and McKean ranked 50.

These maps provide a clear visualization that counties with a history of coal production counties are typically in poor health when examining these community health indicators.

0.5.3 American Community Survey

Graphic of median age, income, pop etc. for all PA counties

# create large color-blind friendly color palette Define the number of
# colors you want
nb.cols <- 40
mycolors <- colorRampPalette(brewer.pal(9, "Set3"))(nb.cols)

# Create individual bar charts for the variables of interest (population,
# race, income etc.) for each of the counties

pop <- ggplot(pa.2019.clean, aes(County, totpop, fill = County)) + geom_bar(stat = "identity") +
    ylab("Total Population") + xlab("County") + theme(axis.text.x = element_text(angle = 90)) +
    scale_fill_manual(values = mycolors) + theme(legend.position = "none")

race <- ggplot(pa.2019.clean, aes(County, perc.white, fill = County)) + geom_bar(stat = "identity") +
    ylab("% White") + xlab("County") + theme(axis.text.x = element_text(angle = 90)) +
    scale_fill_manual(values = mycolors) + theme(legend.position = "none")

income <- ggplot(pa.2019.clean, aes(County, medincome, fill = County)) + geom_bar(stat = "identity") +
    ylab("Median Income") + xlab("County") + theme(axis.text.x = element_text(angle = 90)) +
    scale_fill_manual(values = mycolors) + theme(legend.position = "none")

bachdegree <- ggplot(pa.2019.clean, aes(County, perc.bachdegree, fill = County)) +
    geom_bar(stat = "identity") + ylab("% with Bachelor's Degree") + xlab("County") +
    theme(axis.text.x = element_text(angle = 90)) + scale_fill_manual(values = mycolors)

age <- ggplot(pa.2019.clean, aes(County, medage, fill = County)) + geom_bar(stat = "identity") +
    ylab("Median Age") + xlab("County") + theme(axis.text.x = element_text(angle = 90)) +
    scale_fill_manual(values = mycolors) + theme(legend.position = "none")

# Merge all of the plots into a single figure using a cowplot

ggarrange(pop, race, income, bachdegree, age, align = "hv", labels = c("A",
    "B", "C", "D", "E"), common.legend = T, legend = "bottom", nrows = 2)
Graph depicting median age, median income, total population, percent of population with a bachelor’s degree and percent of population which is white-identifying for Pennsylvania counties in 2019. Data sourced from ACS 1-year estimates.

Graph depicting median age, median income, total population, percent of population with a bachelor’s degree and percent of population which is white-identifying for Pennsylvania counties in 2019. Data sourced from ACS 1-year estimates.

We can observe a number of interesting trends in the above graphic. Looking at total population, we can see that most counties have fewer than 500,000 people. The exceptions to this trend are Philadelphia and Allegheny counties - the counties with the large cities of Philadelphia and Pittsburgh, respectively. Considering race, 75% or more of the populations of most counties identify as white - a notable distinction is Philadelphia. Median income shows that most counties have median incomes of 25,000 or more. There are a few high outliers - specifically Chester and Montgomery counties. Turning to education, we can see that income and the percent of population with a bachelor’s tracks reasonably well - the highest income counties (Chester and Montgomery) also have the highest education levels. Looking at age we can see that most counties have median ages around 40, but Philadelphia and Centre have median ages closer to 30. Considered in the context of national statistics, we can see from this graph that most Pennsylvania counties are a little older, a little poorer, and a little less educated than national population averages. We can also guess that much of Pennsylvania is rural, given the drastic difference in population size and median age between the population hub of Philadelphia and all other counties in the state.

0.6 Analysis

To further understand the relationship of coal production impact on community health in Pennsylvania we conducted two different in depth analyses. A time series and analysis and a multiple linear regression. Additionally, we did a preliminary look at water quality parameters to look at trends in Pennsylvania. An indepth look at each of these analyses are described below.

0.6.1 Time Series Analysis

We first examined the relationship between health outcomes and coal production between 2010 and 2020. While this permits us to investigate the relationship between our two variables of greatest interest, it is important to note that many of the contamination concerns associated with mining occur in the years following coal mining (water contamination from acid mine drainage).

Time Series Data Analysis In our analysis, we look at a few counties individually. When examined all together, the considerable variability across counties causes high error margins to obscure any observable trends.

ggplot(full.time.series.data) + geom_smooth(aes(x = Year, y = Health.Factors.Z.Score,
    color = "Health Factors Z Score", )) + geom_smooth(aes(x = Year, y = Health.Outcomes.Z.Score,
    color = "Health Outcomes Z Score")) + ylab("Health Z Scores")
Synchronous plot of health outcomes and health factors z scores over time across all counties

Synchronous plot of health outcomes and health factors z scores over time across all counties

In this graph, the health outcomes and health factors z scores are presented only from coal producing counties over time. They range from below -1 to above 1, with a somewhat upwards sloping trend over time but with error bars so great this trend is not definite. In contrast, mapping coal production over time shows a decrease over time with a similarly large error range.

ggplot(full.time.series.data) + geom_smooth(aes(x = Year, y = sum.coal.production))
Plot of coal production over time across all counties

Plot of coal production over time across all counties

We looked at how strong this trend across counties was by completing a linear regression.

## 
## Call:
## lm(formula = Health.Factors.Z.Score ~ sum.coal.production, data = full.time.series.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30188 -0.11211  0.05188  0.19390  1.06354 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.444e-03  2.833e-02   0.263    0.793    
## sum.coal.production 1.792e-07  3.999e-08   4.480 1.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4007 on 302 degrees of freedom
## Multiple R-squared:  0.06232,    Adjusted R-squared:  0.05921 
## F-statistic: 20.07 on 1 and 302 DF,  p-value: 1.06e-05
## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ sum.coal.production, data = full.time.series.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76160 -0.37804  0.04129  0.36571  1.63013 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         1.271e-01  4.516e-02   2.814  0.00521 **
## sum.coal.production 1.799e-07  6.375e-08   2.822  0.00510 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6387 on 302 degrees of freedom
## Multiple R-squared:  0.02568,    Adjusted R-squared:  0.02246 
## F-statistic: 7.961 on 1 and 302 DF,  p-value: 0.005096

Both p-values are exceedingly small and below 0.05, designating significant trends. The trends do show decreasing public health corresponding with decreasing coal production which is a counterintuitive result. This does make sense when considering how delayed most of the community impacts of coal mining is. Water contamination occurs and continues for years following the closure of a mine and so the health impacts are not immediately felt by the community. There may have also been other factors such as policy and social service changes that decreased health levels in these counties. Still, we did a further analysis of some specific counties with considerably high coal output to see if any trends revealed themselves in a more localized investigation. The first county we looked at was Schuylkill County. This county holds the most weight as the peak coal production far exceeded that seen in any other county (can be seen as the highest arch in the plot within the coal data set analysis above).

Schuykill County

# comparing health factors z score with coal production in schuylkill
# county
x <- data.frame(Year = c(2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,
    2019, 2020))
Year <- schuylkill.data$year
var1 <- schuylkill.data$Health.Factors.Z.Score
var2 <- schuylkill.data$sum.coal.production
data <- data.frame(x, var1, var2)

obj1 <- xyplot(var1 ~ Year, data, type = "l", lwd = 2, col = "steelblue", ylab = "Health Factors Z Score (Blue)",
    col.axis = "red")
obj2 <- xyplot(var2 ~ Year, data, type = "l", lwd = 2, col = "red", ylab = "Coal Produced in Tons (Red)")

doubleYScale(obj1, obj2, add.ylab2 = TRUE, use.style = FALSE)
Synchronous plot of coal production and health factors z score over time in the Schuylkill county

Synchronous plot of coal production and health factors z score over time in the Schuylkill county

# linear regression for health factors z score and coal production

schuylkill.health.factors.regression <- lm(data = schuylkill.data, Health.Factors.Z.Score ~
    sum.coal.production)

summary(schuylkill.health.factors.regression)
## 
## Call:
## lm(formula = Health.Factors.Z.Score ~ sum.coal.production, data = schuylkill.data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.059895 -0.038608  0.008003  0.038130  0.071284 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.175e-01  1.117e-01  -1.947 0.083390 .  
## sum.coal.production  2.348e-07  4.798e-08   4.893 0.000856 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04805 on 9 degrees of freedom
## Multiple R-squared:  0.7268, Adjusted R-squared:  0.6964 
## F-statistic: 23.94 on 1 and 9 DF,  p-value: 0.0008558
# comparing health outcomes z score in schuylkill county
var3 <- schuylkill.data$Health.Outcomes.Z.Score
data <- data.frame(x, var3, var2)

obj1 <- xyplot(var3 ~ Year, data, type = "l", lwd = 2, col = "steelblue", ylab = "Health Outcomes Z Score (Blue)",
    col.axis = "red")
obj2 <- xyplot(var2 ~ Year, data, type = "l", lwd = 2, col = "red", ylab = "Coal Produced in Tons (Red)")

doubleYScale(obj1, obj2, add.ylab2 = TRUE, use.style = FALSE)
Synchronous plot of coal production and health outcomes z score over time in the Schuylkill county

Synchronous plot of coal production and health outcomes z score over time in the Schuylkill county

# linear regression for health outcomes z score and coal production

schuylkill.health.outcomes.regression <- lm(data = schuylkill.data, Health.Outcomes.Z.Score ~
    sum.coal.production)

summary(schuylkill.health.outcomes.regression)
## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ sum.coal.production, data = schuylkill.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29634 -0.11627  0.01214  0.13328  0.23626 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          1.261e+00  4.506e-01   2.798   0.0208 *
## sum.coal.production -2.564e-07  1.935e-07  -1.325   0.2179  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1938 on 9 degrees of freedom
## Multiple R-squared:  0.1632, Adjusted R-squared:  0.07023 
## F-statistic: 1.755 on 1 and 9 DF,  p-value: 0.2179

The correlation between the health factors z score and coal production over time in Schuylkill county was the most compelling of the time series analysis. There is a clear correlation visible in the line graph where decreasing coal production is clearly associated with decreasing health factors z score (an indicator of increasing access to public health resources). The linear regression below revealed a p-value of 0.000856, far lower than 0.05, indicating a significantly non-zero relationship. Since we are limited to only 11 data points, this could be coincidental, but considering the extreme output of coal from this region, this may be magnifying any adverse effects of coal production and making this trend more visible. The correlation between health outcomes and coal production was less visible. With a p-value of 0.21, there is likely some correlation, but it is still considerably higher than 0.05.

We then examined Armstrong county and Clearfield County as two other counties with considerably high coal production to understand how pervasive the correlations previously identified are.

Armstrong County

# comparing health factors z score with coal production in Armstrong
# county
x <- data.frame(Year = c(2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,
    2019, 2020))
Year <- Armstrong.data$year
var1 <- Armstrong.data$Health.Factors.Z.Score
var2 <- Armstrong.data$sum.coal.production
data <- data.frame(x, var1, var2)

obj1 <- xyplot(var1 ~ Year, data, type = "l", lwd = 2, col = "steelblue", ylab = "Health Factors Z Score (Blue)",
    col.axis = "red")
obj2 <- xyplot(var2 ~ Year, data, type = "l", lwd = 2, col = "red", ylab = "Coal Produced in Tons (Red)")

doubleYScale(obj1, obj2, add.ylab2 = TRUE, use.style = FALSE)
Synchronous plot of coal production and health factors z score over time in the Armstrong county

Synchronous plot of coal production and health factors z score over time in the Armstrong county

# linear regression for health factors z score and coal production

armstrong.health.factors.regression <- lm(data = Armstrong.data, Health.Factors.Z.Score ~
    sum.coal.production)

summary(armstrong.health.factors.regression)
## 
## Call:
## lm(formula = Health.Factors.Z.Score ~ sum.coal.production, data = Armstrong.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16986 -0.02625 -0.02092  0.04709  0.22027 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          2.722e-01  8.230e-02   3.308  0.00911 **
## sum.coal.production -1.468e-08  1.293e-07  -0.113  0.91214   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1059 on 9 degrees of freedom
## Multiple R-squared:  0.001429,   Adjusted R-squared:  -0.1095 
## F-statistic: 0.01288 on 1 and 9 DF,  p-value: 0.9121
# comparing health outcomes z score in Armstrong county
var3 <- Armstrong.data$Health.Outcomes.Z.Score
data <- data.frame(x, var3, var2)

obj1 <- xyplot(var3 ~ Year, data, type = "l", lwd = 2, col = "steelblue", ylab = "Health Outcomes Z Score (Blue)",
    col.axis = "red")
obj2 <- xyplot(var2 ~ Year, data, type = "l", lwd = 2, col = "red", ylab = "Coal Produced in Tons (Red)")

doubleYScale(obj1, obj2, add.ylab2 = TRUE, use.style = FALSE)
Synchronous plot of coal production and health outcomes z score over time in the Armstrong county

Synchronous plot of coal production and health outcomes z score over time in the Armstrong county

# linear regression for health outcomes z score and coal production

armstrong.health.outcomes.regression <- lm(data = Armstrong.data, Health.Outcomes.Z.Score ~
    sum.coal.production)

summary(armstrong.health.outcomes.regression)
## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ sum.coal.production, data = Armstrong.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28911 -0.20703 -0.06661  0.17183  0.53155 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          5.447e-01  2.149e-01   2.535    0.032 *
## sum.coal.production -3.444e-07  3.376e-07  -1.020    0.334  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2764 on 9 degrees of freedom
## Multiple R-squared:  0.1036, Adjusted R-squared:  0.004036 
## F-statistic: 1.041 on 1 and 9 DF,  p-value: 0.3343

Clearfield County

# comparing health factors z score with coal production in clearfield
# county
x <- data.frame(Year = c(2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,
    2019, 2020))
Year <- Clearfield.data$year
var1 <- Clearfield.data$Health.Factors.Z.Score
var2 <- Clearfield.data$sum.coal.production
data <- data.frame(x, var1, var2)

obj1 <- xyplot(var1 ~ Year, data, type = "l", lwd = 2, col = "steelblue", ylab = "Health Factors Z Score (Blue)",
    col.axis = "red")
obj2 <- xyplot(var2 ~ Year, data, type = "l", lwd = 2, col = "red", ylab = "Coal Produced in Tons (Red)")

doubleYScale(obj1, obj2, add.ylab2 = TRUE, use.style = FALSE)
Synchronous plot of coal production and health factors z score over time in the Clearfield county

Synchronous plot of coal production and health factors z score over time in the Clearfield county

# linear regression for health factors z score and coal production

clearfield.health.factors.regression <- lm(data = Clearfield.data, Health.Factors.Z.Score ~
    sum.coal.production)

summary(clearfield.health.factors.regression)
## 
## Call:
## lm(formula = Health.Factors.Z.Score ~ sum.coal.production, data = Clearfield.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20576 -0.06066 -0.01191  0.07777  0.18406 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          3.675e-01  1.360e-01   2.702   0.0243 *
## sum.coal.production -9.712e-08  7.876e-08  -1.233   0.2488  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.125 on 9 degrees of freedom
## Multiple R-squared:  0.1445, Adjusted R-squared:  0.04947 
## F-statistic:  1.52 on 1 and 9 DF,  p-value: 0.2488
# comparing health outcomes z score in clearfield county
var3 <- Clearfield.data$Health.Outcomes.Z.Score
data <- data.frame(x, var3, var2)

obj1 <- xyplot(var3 ~ Year, data, type = "l", lwd = 2, col = "steelblue", ylab = "Health Outcomes Z Score (Blue)",
    col.axis = "red")
obj2 <- xyplot(var2 ~ Year, data, type = "l", lwd = 2, col = "red", ylab = "Coal Produced in Tons (Red)")

doubleYScale(obj1, obj2, add.ylab2 = TRUE, use.style = FALSE)
Synchronous plot of coal production and health outcomes z score over time in the Clearfield county

Synchronous plot of coal production and health outcomes z score over time in the Clearfield county

# linear regression for health outcomes z score and coal production

clearfield.health.outcomes.regression <- lm(data = Clearfield.data, Health.Outcomes.Z.Score ~
    sum.coal.production)

summary(clearfield.health.outcomes.regression)
## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ sum.coal.production, data = Clearfield.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19422 -0.11952  0.04326  0.11181  0.15008 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)          1.237e-01  1.466e-01   0.844    0.421
## sum.coal.production -6.077e-09  8.487e-08  -0.072    0.944
## 
## Residual standard error: 0.1347 on 9 degrees of freedom
## Multiple R-squared:  0.0005693,  Adjusted R-squared:  -0.1105 
## F-statistic: 0.005126 on 1 and 9 DF,  p-value: 0.9445

The trends for these counties were much more inconsistent. None had p-values below or near 0.05. Much of this issue may be the result of only having 11 data points and so an increase in health data in the more distant past would be crucial to understanding how strong this correlation is. Sadly, much of the more robust data only began a more rigorous collection more recently and so this is difficult to find. Overall, the delayed nature of community health impacts from coal mining made conclusive findings through time series analysis difficult.

0.6.2 Multiple Linear Regression

Mixed Correlation Plots

# re-import the data

data <- read.csv("./ProcessedData/health.coal.acs.impute.csv")

# format the data to get values that you want for the regression and
# correlation analysis - ie no NAs, no counties with 0 white people
data.edit <- data %>%
    filter(perc.white != 0) %>%
    na.omit() %>%
    select(!X)

# remove county and other non-informative (and/or categorical) variables
# to permit correlation plot
acs.health.coal.cor.edit <- data.edit %>%
    select(!c(County, totpop, totpopwhite, bachdegree, Health.Factors.Z.Score,
        Health.Factors.Rank, year)) %>%
    na.omit()

# Run correlation plot
acs.health.coal.Corr <- cor(acs.health.coal.cor.edit)
corrplot(acs.health.coal.Corr, method = "circle", type = "upper")
Mixed correlation plot for all variables of interest. Blue values are positive and red values are negative. Correlations performed using Pearson’s correlation coefficient.

Mixed correlation plot for all variables of interest. Blue values are positive and red values are negative. Correlations performed using Pearson’s correlation coefficient.

corrplot.mixed(acs.health.coal.Corr, upper = "circle")
Mixed correlation plot for all variables of interest. Blue values are positive and red values are negative. Correlations performed using Pearson’s correlation coefficient.

Mixed correlation plot for all variables of interest. Blue values are positive and red values are negative. Correlations performed using Pearson’s correlation coefficient.

An initial plot of correlations between each of the variables of interest - median age, median income, percent white, percent with a bachelor’s degree, health outcomes Z score, health outcomes rank, sum of coal production over time and years from the peak coal production to 2019 - reveal a number of interesting trends. First we see that both median income and percent with bachelor’s degree are strongly negatively associated with health outcomes (coefficient = -0.62, -0.63 for income and rank and income and z-score, coefficient = -0.51 and -0.54 for percent bachelor’s degree and rank and percent bachelor’s degree and z-score). This clearly matches the prior findings that wealth and education have strong impacts on health. Looking specifically at our variables of greatest interest - total coal production and years from peak production - we see a number of interesting relationships. First, we see that health outcomes have positive relationships with total coal production (coefficient = 0.26 and 0.31, z-score and rank). This follows expected behavior, as increasing rank value for health outcome indicates worse health. We can then see that total coal production is strongly negatively associated with years from peak production (coefficient = -0.46). This suggests that increasing total coal production is associated with decreasing period from peak year of production until now. This is likely informed by our imputation of 37 years for all counties outside of the coal dataset, which would skew low coal counties towards larger periods from 2019. Total coal production was positively associated with median age and percent white, but negatively with income and education (coefficients = 0.36, 0.32, -0.35, -0.44 respectively). This fits with our understanding of coal as a product produced in rural, white, older, and now-poor areas. Years from peak production reverse the relationships with total coal - negative relationships with health outcomes, median age and percent white while positive relationships with bachelor’s degree and income (coefficients = -0.29, -0.38, -0.43, -0.30, 0.31, 0.26 respectively). This may be a result of actual data - increasing years from peak increases a given county’s income and education - but may also in part be an artifact of our imputation of 37 years for all counties without coal production.

Collectively, these results suggest that, while income and education increase community health, and age decreases health outcomes, total coal production and years from peak production do not have a particularly strong relationship with health outcomes.

The linear regression below confirm many of the findings from the correlation plot above. For our first model formulation, the adjusted R-squared is relatively high, suggesting strong fit of the model (Adj. R-squared = 0.7228, 29 degrees of freedom). Median age, median income and percentage white-identifying had significantly non-zero relationships with health outcomes (age: coeff.=0.01566, p<0.001, income: coeff.=-0.0001, p<0.001, race: coeff. = -0.05156, p=0.001). Our variables of interest - total coal production and years from peak production - did not have non-zero relationships (p=0.32 and 0.96 respectively).

Linear regression Model 1 Results

## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ totpop + medage + medincome + 
##     perc.white + perc.bachdegree + year.from.peak.2019 + sum.coal.prod, 
##     data = data.edit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85760 -0.25237  0.03874  0.31684  0.85372 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.863e+00  1.563e+00   1.192 0.242856    
## totpop               6.028e-07  4.652e-07   1.296 0.205236    
## medage               1.566e-01  3.616e-02   4.331 0.000162 ***
## medincome           -1.193e-04  3.016e-05  -3.955 0.000452 ***
## perc.white          -5.156e-02  1.418e-02  -3.635 0.001066 ** 
## perc.bachdegree     -1.306e-02  4.879e-02  -0.268 0.790842    
## year.from.peak.2019 -8.422e-03  8.255e-03  -1.020 0.316067    
## sum.coal.prod        2.413e-10  5.372e-09   0.045 0.964478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.475 on 29 degrees of freedom
## Multiple R-squared:  0.7767, Adjusted R-squared:  0.7228 
## F-statistic: 14.41 on 7 and 29 DF,  p-value: 6.448e-08

After running the Akaike Information Criteria test to evaluate model overfitting, we reduced the explanatory variables to only include the variables ‘percent white’, ‘median age’, and ‘median income’ (initial AIC = -48.1, final AIC = -52.6. Df = 1). This new model displayed significant results for median age (coeff. = 0.1716, p-value<0.0001), percent white (coeff. = -0.06446, p-value <0.0001), and median income (coeff. = -0.0001206, p-value <0.0001). The adjusted r-squared was 0.7322.

AIC Results

## Start:  AIC=-48.1
## Health.Outcomes.Z.Score ~ totpop + medage + medincome + perc.white + 
##     perc.bachdegree + year.from.peak.2019 + sum.coal.prod
## 
##                       Df Sum of Sq     RSS     AIC
## - sum.coal.prod        1    0.0005  6.5443 -50.096
## - perc.bachdegree      1    0.0162  6.5601 -50.007
## - year.from.peak.2019  1    0.2349  6.7788 -48.794
## <none>                              6.5439 -48.098
## - totpop               1    0.3789  6.9228 -48.015
## - perc.white           1    2.9821  9.5260 -36.205
## - medincome            1    3.5297 10.0736 -34.137
## - medage               1    4.2318 10.7757 -31.644
## 
## Step:  AIC=-50.1
## Health.Outcomes.Z.Score ~ totpop + medage + medincome + perc.white + 
##     perc.bachdegree + year.from.peak.2019
## 
##                       Df Sum of Sq     RSS     AIC
## - perc.bachdegree      1    0.0182  6.5625 -51.993
## - year.from.peak.2019  1    0.2710  6.8154 -50.594
## <none>                              6.5443 -50.096
## - totpop               1    0.3948  6.9391 -49.928
## - perc.white           1    3.0693  9.6137 -37.866
## - medincome            1    3.5294 10.0738 -36.136
## - medage               1    4.2439 10.7882 -33.601
## 
## Step:  AIC=-51.99
## Health.Outcomes.Z.Score ~ totpop + medage + medincome + perc.white + 
##     year.from.peak.2019
## 
##                       Df Sum of Sq     RSS     AIC
## - year.from.peak.2019  1    0.2700  6.8325 -52.501
## <none>                              6.5625 -51.993
## - totpop               1    0.3768  6.9394 -51.927
## - perc.white           1    3.1791  9.7417 -39.377
## - medage               1    4.7691 11.3316 -33.783
## - medincome            1   11.8550 18.4175 -15.812
## 
## Step:  AIC=-52.5
## Health.Outcomes.Z.Score ~ totpop + medage + medincome + perc.white
## 
##              Df Sum of Sq     RSS     AIC
## - totpop      1    0.3606  7.1932 -52.598
## <none>                     6.8325 -52.501
## - perc.white  1    3.2138 10.0463 -40.237
## - medage      1    6.2403 13.0728 -30.494
## - medincome   1   13.0509 19.8834 -14.978
## 
## Step:  AIC=-52.6
## Health.Outcomes.Z.Score ~ medage + medincome + perc.white
## 
##              Df Sum of Sq     RSS     AIC
## <none>                     7.1932 -52.598
## - medage      1    6.2043 13.3975 -31.586
## - perc.white  1    9.9889 17.1821 -22.381
## - medincome   1   13.8646 21.0578 -14.855
## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ medage + medincome + perc.white, 
##     data = data.edit)
## 
## Coefficients:
## (Intercept)       medage    medincome   perc.white  
##   2.2165631    0.1716122   -0.0001206   -0.0646643

Linear Regression Model 2 Results

## 
## Call:
## lm(formula = Health.Outcomes.Z.Score ~ medage + perc.white + 
##     medincome, data = data.edit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91845 -0.26826 -0.03308  0.26937  0.86663 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.217e+00  1.219e+00   1.819   0.0781 .  
## medage       1.716e-01  3.217e-02   5.335 6.86e-06 ***
## perc.white  -6.466e-02  9.552e-03  -6.769 1.02e-07 ***
## medincome   -1.206e-04  1.512e-05  -7.975 3.36e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4669 on 33 degrees of freedom
## Multiple R-squared:  0.7545, Adjusted R-squared:  0.7322 
## F-statistic: 33.81 on 3 and 33 DF,  p-value: 3.534e-10

These results show that, while total population and years from peak coal production have little statistical relationship with a county’s health outcomes, increasing age decreases a county’s health and increasing whiteness and income both increase a county’s health.

Unfortunately the assumptions of a multiple linear regression were not met during this analysis, bringing the above results into question. We can see from the residuals vs. fitted and scale location plots clear evidence of heteroskedasticity (non equal variance in residuals over the range of the model) and in the Normal Q-Q plot some indication of a lack of normality of the joint distribution particularly near extreme values. We can see these outliers more clearly in the Residuals vs. Leverage plot - particularly the outlier value ‘9’ which passes Cook’s Distance limits at both the 0.5 and the 1 levels. We can, however, note that the relationships were broadly linear, looking at the scatterplot below. These issues may have come in part from a low number of observations. Due to the use of three different datasets, each which covered different counties within Pennsylvania, there were only 37 counties which had full datasets and therefore could be analyzed by this regression. This is a relatively small sample size.

Plots: residual vs. fitted, scale location, Normal Q-Q, Residuals vs. Leverage
Plots to examine multiple linear regression assumptions (homoscedasticity, normality)

Plots to examine multiple linear regression assumptions (homoscedasticity, normality)

Cowplot of explanatory vs response variables

health.age <- ggplot(data.edit, aes(medage, Health.Outcomes.Z.Score)) + geom_point() +
    ylab("Health Outcomes") + xlab("Median Age") + theme(axis.text.x = element_text(angle = 90)) +
    theme(legend.position = "none") + theme_classic()

health.race <- ggplot(data.edit, aes(perc.white, Health.Outcomes.Z.Score)) +
    geom_point() + ylab("Health Outcomes") + xlab("% White-Identifying") + theme(axis.text.x = element_text(angle = 90)) +
    theme(legend.position = "none") + theme_classic()

health.inc <- ggplot(data.edit, aes(medincome, Health.Outcomes.Z.Score)) + geom_point() +
    ylab("Health Outcomes") + xlab("Median Income (USD)") + theme(axis.text.x = element_text(angle = 90)) +
    theme(legend.position = "none") + theme_classic()

ggarrange(health.age, health.race, health.inc, align = "hv", labels = c("A",
    "B", "C"), nrows = 3)
Plots to examine linearity assumption of multiple linear regression model.

Plots to examine linearity assumption of multiple linear regression model.

0.7 Summary and Conclusions

This analysis reveals a number of broad takeaways. The most significant is that, despite an apparent association between coal production and health outcomes from a visual analysis of the spatial distribution of each of those variables, total coal production and the period of time from peak coal production did not explain health outcomes to any significant extent. Instead, median age was the clearest predictor of poor health outcomes, and income and education ensured good health outcomes. While these findings fit decades of research on the causes of community health or illness, they do not align with our initial hypothesis concerning the negative impact of coal mining on community health. And while the time series statistical analysis of the county of Schuylkill suggested a relationship between coal production and poor health outcomes, limited data and the wide variability outside of that county suggests no clear link between these variables over time.

There are a number of reasons that we may have observed this unexpected result. The first is spatial scale. Coal mining may have strong negative impacts on localized areas - areas the size of a census block, for instance - but not on larger areas. That would mean that our choice of county-level granularity would have smoothed out any influence of coal mining on human health. The second is the long causal chain to health outcomes that we had neither the time nor resources to explore. Countless factors besides the socioeconomic variables we selected and coal production itself likely have impacts on health outcomes. For example, contamination of ground water through acidic main drainage is one probable causal linkage between coal production and declines in community health. When coal is exposed through mining, water permeating the mine runs across exposed coal which includes deposits of pyrite and other heavy metals. The pyrite is able to dissolve as sulfuric acid, increasing the acidity of the water and therein the weathering on the coal surface. This results in highly acidic and toxic mine runoff that, if not successfully treated, can infiltrate into the groundwater. Our own preliminary analysis of mine runoff data supports this possible causal linkage.

Sulfate Water Quality Plot

Only.Sulfate <- filter(WaterQualitySlim, Contaminant == "Sulfate as SO4")

Sulfate.Over.Time <- ggplot(Only.Sulfate, aes(Date, Measure, group = Contaminant)) +
    geom_point() + ylab("Sulfate as SO4 Concentration (mg/l)")

print(Sulfate.Over.Time)
SO4 Levels Seen in Acid Mine Drainage in Pennsylvania

SO4 Levels Seen in Acid Mine Drainage in Pennsylvania

Average water usually has SO4 levels of 5 to 50 mg/L (History of Sulfate Use). Certain sampling locations did fall within that range, but many are far higher, some reaching levels over 500. This is harmful in and of itself but it also increases the acidity of the water and therein increases the solubility of other more toxic heavy metals (RoyChowhurdy et al., 2015). This can lead to exceptionally high levels of widespread contamination and can impact groundwater. We sadly could not find sufficient data in this area to fully investigate this thread, but this preliminary finding does indicate that water quality is a possible point of connection.

The nature of this causal linkage would likely vary by community. Some communities may use groundwater for drinking wells while others may not. In this way, communities which mined coal but not used groundwater wells - and therefore not felt the effects of mine drainage contamination - may have obfuscated the impact of coal mining on communities which do use groundwater extensively in the course of our current analysis. By not including explanatory variables capturing causal linkages such as this one, we once again potentially paper over the influence of coal production on community health.

0.8 References

Rose M. Mueller, Surface coal mining and public health disparities: Evidence from Appalachia, Resources Policy,Volume 76, 2022, 102567. https://doi.org/10.1016/j.resourpol.2022.102567.

Gasparotto, J. and K. Da Boit Martinello (2021). “Coal as an energy source and its impacts on human health.” Energy Geoscience 2(2): 113-120.

G. Leon-Mejia, M. Quintana, R.Debastiani, J. Dias, L. Espitia-Perez, A. Hartmann, J. Da Silva“Genetic damage in coal miners evaluated by buccal micronucleus cytome assay” Ecotoxicol.Environ. Saf., 107 (2014), pp. 133-139, 10.1016/j.ecoenv.2014.05.023

Marius Campbell, Coal fields of the United States. Department of the Interior, U.S. Geological Survey, February 24, 1917. https://pubs.usgs.gov/pp/0100a/report.pdf

Pennsylvania Department of Environmental Quality. Coal mining in Pennsylvania. PA Mining history. https://www.dep.pa.gov/Business/Land/Mining/Pages/PA-Mining-History.aspx. Last accessed on November 8, 2022)

RoyChowdhury, A., Sarkar, D. & Datta, R. Remediation of Acid Mine Drainage-Impacted Water. Curr Pollution Rep 1, 131–141 (2015). https://doi.org/10.1007/s40726-015-0011-3

History of Sulfate Use. Master water conditioning corporation. (n.d.). Retrieved December 13, 2022, from https://www.masterwater.com/